The Coalescent

Eric C. Anderson

Conservation Genomics Workshop, Monday August 26, 2024

Overview

Goal: Introduce the Coalescent and convey why it is a crucial model for understanding genetic variation

Outline:

  • Deriving the coalescent from the Wright-Fisher model
    • R-notebook interlude: the exponential and geometric distributions
  • Expected properties of coalescent trees
    • Expected interval times
    • Expected total branch lengths
  • Shapes of the coalescent under different demographic scenarios
    • Hands-on: simulating and visualizing coalescent trees
  • Mutations on the coalescent
    • Group activity: “Find your branch”
  • The site frequency spectrum
    • Hands-on: simulating 1-D SFS from different demographic scenarios

The Wright Fisher model

A simple model for the random process of genes getting from one generation to the next in a population

Assumptions

  • Population of constant size
  • No selection
  • Haploid organisms (diploids are treated) as \(2N\) haploids
  • No sexes
  • Next generation is obtained by sampling from replacement from amongst the gene copies of the previous generation

The Wright Fisher model. 40 diploids for 10 generations

The Wright Fisher model. Same thing, but re-ordered

The Wright Fisher model. Colored Alleles

The Wright Fisher model. Focus on a sample

A Model With Higher Variance in Repro. Succ.

Lineages Coalesce Faster, here

Looks like a Smaller Wright-Fisher population

  • Changes in allele frequencies (or merging of lineages) in that non-WF population happens at the same rate as you would expect in a Wright-Fisher population of smaller size.

  • The effective size of a real populations (\(N_e\)) is the size of an ideal (i.e., Wright-Fisher) population that has similar “genetic behavior” as the real population.

  • So, results we obtain with the W-F model can be applied to real populations by using their effective size.

Lineages in a Bigger Population for more Generations

The Coalescent Describes The Random Process of Lineages Finding Common Ancestors

  • Focuses only on the properties of sampled genes/sequences
  • Need not consider all the grey individuals (great for simulation)
  • Particularly useful for sequence data
    • So, think of each of these colored balls as a segment of DNA being copied and handed down from one generation to the next.
  • Is pretty easy to derive from the Wright-Fisher model
  • Ultimately, understanding the coalescent helps you to understand how different demographic or evolutionary effects will change the genetic data you expect to see from populations that you study

Let’s derive the coalescent from the W-F model

  • Start small—focus first on a sample of two gene copies:
  • Consider two gene copies and trace their lineages backwards in time:
  • Terminology: sampled gene copies in the present become lineages that travel through gene copies present in previous generations
  • Two lineages coalesce in the generation that their common ancestor lived in.

Simple probabilities

Let there be \(2N\) haploids (\(N\) diploids) in a Wright-Fisher population.

The probability that two lineages coalesce (arose from a common ancestor) one generation in the past is: \[ \frac{1}{2N} \] The probability that they don’t coalesce one generation in the past is simply: \[ 1 - \frac{1}{2N} \] So, the probability that two lineages coalesce after \(t\) generations is \[ \biggl(1 - \frac{1}{2N}\biggr)^{t - 1}\biggl(\frac{1}{2N}\biggr) \]

This is a geometric distribution.

The geometric distribution is well-approximated by an exponential distribution with the same mean

With \(k>2\) lineages, we wait until the first pair coalesces, and then repeat with one less lineage

  • With \(k\) lineages there are \(\frac{k(k-1)}{2}\) pairs of lineages
  • Each pair of lineages is independently1 waiting to coalesce as we go back in time.

\[ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ \]

The time to the coalescence of the first coalescing pair has an exponential distribution with mean: \[ 2N\frac{2}{k(k-1)} = \frac{4N}{k(k-1)} \] After that first pair coalesces, the two lineages involved become a single lineage and then the process waits for the first coalescence between a pair of \(k-1\) lineages.

And so forth.

The last two extant lineages coalesce into the “most recent common ancestor” (MRCA) of the sample

The Anatomy of a Coalescent Tree

  • Let \(T_k\) be the length of time (number of generations) during which there are \(k\) extant lineages in the sample.
  • To the left, \(T_{10}\) is the time until the first pair of sampled gene copies coalesces.
  • \(T_\mathrm{MRCA}\) is the time to the most recent common ancestor.
    • i.e., the time it takes for all the lineages to have coalesced
  • Recall: \[ \mathbb{E}T_k = \frac{4N}{k(k-1)} \]

Vertical Lines are Generations

  • I like to think of the vertical lines (the branches) in the coalescent as strings of beads—each bead is a generation, and there can be a lot of them.

  • Each generation = a meiosis in the lineage

  • What can happen during meiosis? (mutation!)

  • Neutral mutations do not affect the shape of the tree

  • This is the neutral coalescent. (Selection renders things much more difficult.)

Expected Properties of the Coalescent with \(n\) tips

Expected time to the MRCA

\[ \begin{aligned} \mathbb{E}T_\mathrm{MRCA} &= \mathbb{E}\biggl[T_2 + T_3 + \cdots + T_n\biggr] \\ &= \sum_{k=2}^n \frac{4N}{k(k-1)} \\ &= 4N \sum_{k=2}^n \biggl( \frac{1}{k-1} - \frac{1}{k}\biggr) \\ &= 4N\biggl(1 - \frac{1}{n}\biggr) \end{aligned} \]

  • Wow! Even with a huge sample of gene copies (large \(n\)), the expected time to the MRCA is no more than twice that of a sample of 2 gene copies.

Expected Properties of the Coalescent with \(n\) tips

Expected total branch length

\[ \begin{aligned} \mathbb{E}T_\mathrm{Tot} &= \mathbb{E}\biggl[2T_2 + 3T_3 + \cdots + nT_n\biggr] \\ &= \sum_{k=2}^n k\frac{4N}{k(k-1)} \\ &= \sum_{k=2}^n \frac{4N}{(k-1)} \\ &= 4N\sum_{k=1}^{n-1} \frac{1}{k} \end{aligned} \]

This is the expected number of opportunities (generations/meioses) for mutations to occur.

The Neutral Coalescent with Varying Demography

We have already noted that non-neutral mutations violate the assumptions of the coalescent.

However, many demographic scenarios can be accommodated in the coalescent framework1. E.g.:

  • Population growth / decline
  • Population structure:
    • Migration
    • Population splitting
    • Population joining

Each of these scenarios affect the shape of the coalescent tree.

How Might Population Size Changes Affect the Tree?

25141036879

  • Imagine that the population was larger in the past. How do you expect that to affect \(T_{10}\)? What about \(T_2\)?

  • Now, what if the population was smaller in the past?

  • We have a short hands-on that lets us simulate, and then visualize, coalescent trees under different demographic scenarios.

A link to it can be found at: https://eriqande.github.io/coalescent-hands-on/

You should have already done the commands in 000-prep-for-coalescent.nb.html, and now we will be playing with 002-the-coalescent-and-demographic-scenarios.nb.html.

Mutations Occur During Meioses

And they are inherited by the descendants

25141036879

  • The little arrows indicate generic mutations
  • The colored balls at the bottom indicate the sampled gene copies that inherit the mutations

Mutations Occur During Meioses

And they are inherited by the descendants

25141036879

  • The little arrows indicate generic mutations
  • The colored balls at the bottom indicate the sampled gene copies that inherit the mutations
  • Samples can inherit multiple mutations

The Infinite Sites Model

Every mutation is a different SNP

The Scaled Mutation Parameter \(\theta = 4N\mu L\)

  • When we talk about mutation rates, it is common to scale them by \(4N\) or \(2N\).
  • ms scales them by \(4N\). Why? And what does this mean?
  • Let:
    • \(\mu\) be the per-base mutation rate per-generation
    • \(L\) be the number bases in a segment of DNA
  • Then \(\mu L\) is the probability of a mutation each generation somewhere in the sequence.
  • Recall that \(2N\) is the expected time for a pair of sampled genes (sequences) to coalesce in a population of \(N\) diploids.
    • i.e. \(\mathbb{E}T_2 = 2N\).
  • Thus, there the expected number of bases that differ between a pair of sequences is \(4N\mu L\).
  • That is the expected number of heterozygous sites between a pair of sequences.

The Site Frequency Spectrum (SFS)

  • If you have \(n\) samples (tips on the tree), then any mutation you see in your sample will occur in some number of samples between 1 and \(n - 1\)

  • Let \(S_n^i\) denote the number of mutations that were inherited by \(i\) haploid sequences out of a sample of \(n\) of them.

  • This is a common way to summarise data in conservation genetics and contains a lot of information about the history of the sample

The SFS on our Tiny Example

Only three SNPs so not very representative!

25141036879AGCTG CSNP 1SNP 2SNP 3GCGGCGACGATGATGATGATCATCATCATC

  • At the first SNP the mutation is seen in 2 out of 10 samples
  • At the 2nd SNP, 7 out of 10 samples.
  • At 3rd SNP, 4 out of 10 samples.

So, from this silly, tiny example, we have: \[ S_{10}^2 = 1~~~~~S_{10}^7 = 1~~~~~~S_{10}^4 = 1 \] with all other \(S_n^i = 0\).

The Expected SFS of the Neutral Coalescent

In a neutral coalescent with no population structure and constant population size1: \[ \mathbb{E}S_n^i = \frac{\theta}{i}~~~~\mathrm{for}~1\leq i < n \] The key here is the shape. What does it look like? Here it is for \(n=50\) and arbitrary \(\theta\):

The site frequency spectrum can be observed

  • You can estimate the SFS from genomic data

  • By comparing it to the SFS expected under different scenarios, you can attempt to infer the demographic history of a population

  • We won’t go into it, but many programs can do this:

    • fastsimcoal
    • dadi, i.e., \(\partial a \partial i\)
    • moments
  • For now, we will simulate some one-dimensional SFS from the coalescent to gain some intuition about how the shape of the coalescent affects the one-dimensional SFS.

  • https://eriqande.github.io/coalescent-hands-on/004-one-dimensional-SFS.nb.html

Wrap Up

We have only scratched the surface here, but I hope that some of y’all are convinced that the coalescent is an important topic to study more for understanding methodology in conservation genomics.